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Abstract 

Living cells are the product of gene expression programs that involve the regulated tran- 
scription of thousands of genes. The elucidation of transcriptional regulatory networks in thus 
needed to understand the cell's working mechanism, and can for example be useful for the dis- 
covery of novel therapeutic targets. Although several methods have been proposed to infer gene 
regulatory networks from gene expression data, a recent comparison on a large-scale benchmark 
experiment revealed that most current methods only predict a limited number of known regu- 
lations at a reasonable precision level. We propose SIRENE, a new method for the inference of 
gene regulatory networks from a compendium of expression data. The method decomposes the 
problem of gene regulatory network inference into a large number of local binary classification 
problems, that focus on separating target genes from non-targets for each TF. SIRENE is thus 
conceptually simple and computationally efficient. We test it on a benchmark experiment aimed 
at predicting regulations in E. coli, and show that it retrieves of the order of 6 times more known 
regulations than other state-of-the-art inference methods. 

1 Introduction 

Elucidating the structure of gene regulatory networks is crucial to understand how transcription fac- 
tors (TF) regulate gene expression and allow an organism to regulate its metabolism and adapt itself 
to environmental changes. While high-throughput sequencing and other post-genomics technologies 
offer a wealth of information about individual genes, the experimental characterization of transcrip- 
tional cis-regulation at a genome scale remains a daunting challenge, even for well-studied model 
organisms. In silico methods that attempt to reconstruct such global gene regulatory networks 
from prior biological knowledge and available genomic and post-genomic data therefore constitute 
an interesting direction towards the elucidation of these networks. 

Transcriptional cis-regulation directly influences the level of mRNA transcripts of regulated 
genes. Not surprisingly, many in silico methods have been proposed to reconstruc t gene regula- 



tory networks from gene expression data, produced at a fast rate by microarrays (jBansal et al 



20071 ). Clustering gene expression profiles across different conditions identifies groups of genes 



with similar transcriptomic response, suggesting co-regulation within each group (|Tavazoie et al 



Clustering methods are widely used, computationally efficient, but do not easily lead to the 



identification of regulators for a given set of genes. Some authors nonetheless have observed that 
identifying similarities, or more generally mutual informati on between the expression profi l es of a 
TF and of a target gene is a good indicator of regulation (IButte et all 12000 : iFaith et ~al\. I2007T I. 
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When time series of gene expression data are available, other reverse-engineering methodologies can 
be applied to capture the interactions governing the observed dynamics. Different mathematic al 



form alisms have been proposed to model such dynamics, incl uding boolean networks (lAkutsu 



200C) or ordinary or stochastic partial differential equations (Chen et al. . 19991 : Teener et al 



et al 



2003 



Gardner et all 12003 ; IChen et all 120051 ; Idi Bernardo et all 1200.4 iBansal et q/,l . l2006l ) Some authors 



have also attempted to detect causality relationships between gene expression data, be they time 
series or compendia of various experiments, using statistical methods such as Bayesian networks 



((Friedman et a/! l2000l ). These methods that estimate the regulatory network by fitting a dynamic 



or statistical model are often computationally and data demanding. 

The comparison of these different approaches and of their capacity to accurately reconstruct 
large-scale regulatory networks has been hampered by the difficulty to assemble a realistic set of 
biologically validated regul atory relationships and use it as a benchmark to assess the performance 
of each method. Recently, Faith et al. (|2007l ) compiled such a benchmark, by gathering all known 
transcriptional cis-regulation in Escherichia coli and collecting a compendium of several hundreds 
of gene e xpression profiling exp eriments. T hey compared several approaches, including Bayesian 



networks (Friedman et "oTL l200nh . ARACNe dMargolin et all l200fih . and the context likelihood of 



related ness (CLR) algorit hm, a new method that extends the relevance networks class of algo- 
rithms (IButte et They observed that CLR outperformed all other methods in prediction 
accuracy, and experimentally validated some predictions. CLR can therefore be considered as state- 
of-the-art among methods that use compendia of gene expression data for large-scale inference of 
regulatory networks. 

In this paper we present SIRENE (Supervised Inference of REgulatory NEtworks), a new method 
to infer gene regulatory networks on a genome scale from a compendium of gene expression data. 
SIRENE differs fundamentally from other approaches in that it requires as inputs not only gene 
expression data, but also a list of known regulation relationships between TF and target genes. In 
machine learning terminology, the method is supervised in the sense that it uses a partial knowledge 
of the information we want to predict in order to guide the inference engine for the prediction of 
new information. The necessity to input some known regulations is not a serious restriction in 
many applications, as many regulations have already been characterized in model organisms, and 
can be inferred by homology in newly sequenced genomes. Known regulations allow us to use a 
natural induction principle to predict new regulations: if a gene A has an expression profile similar 
to a gene B known to be regulated by a given TF, then gene A is likely to be also regulated by 
the TF. The fact that genes with similar expression profiles are likely to be co-regulated has been 
used for a long time in the construction of groups of genes by unsupervised clustering of expression 
profiles. The novelty in our approach is to use this principle in a supervised classification paradigm. 
This inference paradigm has the advantage that no particular hypothesis is made regarding the 
relationship between the expression data of a TF and those of regulated genes. In fact, expression 
data for the TF are not even needed in our approach. 

Many algorithms for supervised classification can be used to transform this inference principle 
into a working algorithm. We use in our experiments the support vector machine (SVM) algorithm, a 
state-of-the-art method for supervised classification. The idea to cast the problem of gene or protein 
networks inference as a supervised classification problem, using known interactions as inputs, has 
been recently proposed a nd investigated for the reconstruction of pro t ein-p ro tein interaction (PPL ) 



and metabolic networks ( Yamanishi et al. . 2004 : Ben-Hur and Noble . 20051 ). Bleakley et al. ((20071 



proposed a simple method where a local model is estimated to predict the interacting partners 
of each protein in the network, and all local models are then combined together to predict edges 
throughout the network. They showed that this method gave important improvement in accuracy 
compared to more elaborated methods on both the PPI and metabolic networks. Here we adapt this 
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strategy for the reconstruction of gene regulatory networks. For each TF, we estimate a local model 
to discriminate, based on their expression profiles, the genes regulated by the TF from others genes. 
All local models are then combined to rank candidate regulatory relationships between TFs and 
all genes in the genome. SIRENE is conceptually simple, easy to implement, and computationally 
scalable to whole genomes because each local model only involves the training of a supervised 
classification algorithm on a few hundreds or thousands examples. 



We test SIRENE on the benchmark experiment proposed by lFaith et jjZI (l2007h . which aims at 



reconstructing known regulations within E. coli genes from a compendium of ge ne expression data . 
On this benchmark, SIRENE strongly outperforms the best results reported by lFaith et 
with the CLR algorithm. For example, at a 60% true positive rate (precision), CLR identifies 7.5% 
of all known regulatory relationships (recall), while SIRENE has a recall of 44.5% at the same 
precision level using expression profiles. 



2 System and Methods 
2.1 SIRENE 

SIRENE is a general method to infer new regulation relationships between known TF and all genes 
of an organism. It requires two types of data as inputs. First, each gene in the organism needs to be 
characterized by some data, in our case a vector of expression values in a compendium of expression 
profiles. Second, a list of known regulation relationships between known TF and some genes is 
needed. More precisely, for each TF, we need a list of genes known to be regulated by the TF, and 
if possible a list of genes known not to be regulated by it. Such lists can typically be constructed 
from publi cly available da t abase s of experimentally characterized regulation, e.g., RegulonDB for E. 
coli genes dSalgado et all l200fih . While such databases usually do not contain informations about 



the absence of regulation, we discuss in Section [2T31 below how we generate negative examples. 

When such data are available, SIRENE splits the problem of regulatory network inference into 
many binary classification subproblems, one subproblem being associated to each TF. More precisely, 
for each TF, SIRENE trains a binary classifier to discriminate between genes known to be regulated 
and genes known not to be regulated by the TF, based on the data that characterize the genes 
(e.g., expression data). The rationale behind this approach is that, although we make no hypothesis 
regarding the relationship between the measured expression level of a TF and its targets, we assume 
that if two genes are regulated by the same TF then they are likely to exhibit similar expression 
patterns. In our implementation, we use a SVM to solve the binary classification problems (Section 
12.21) . but any other algorithm for supervised binary classification could in principle be used. Once 
trained, the model associated to a given TF is able to assign to each new gene, not used during 
training, a score that tends to be positive and large when it believes, based on the data that 
characterize the gene, that the gene is regulated by the TF. The final step is to combine all scores 
of the different models to rank the candidate TF-gene interactions in a unique list by decreasing 
score. 

In summary, SIRENE decomposes the difficult problem of gene regulatory network inference 
into a large number of subproblems that attempt to estim ate local models to ch aracterize the genes 



regulated by each TF. A similar approach was proposed by lBleakley et al\ (120071 ) to infer undirected 
graphs, and successfully tested on the reconstruction of metabolic and PPI networks. Here we are 
confronted with a slightly different problem, since the graph we wish to infer is directed and we just 
need to infer local models to predict genes regulated by any given TF. 
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2.2 SVM 



In our implementation of SIRENE, we use a SVM to train predictors for each local model associated 
to a TF. SVM is a popular algorithm to solve general supervised binary classification problems 
which is consider ed state-of-the - art in many applicatio ns and is available in many free and public 



implementations ( Vapnik . 19981 : Scholkopf et al. . 2004 ). The basic ingredient of a SVM is a kernel 



function K(x, y) between any two genes x and y, that can often be thought of as a measure of 
similarity between the genes. In our case, the similarity between genes is measured in terms of 
expression profiles. Given a set of n genes x\, . . . , x n that belong to two classes, denoted arbitrarily 
— 1 and +1, a SVM estimates a scoring function for any new gene x of the form: 

n 

f( x ) = y^ttiK(xj,x) . 

The weights a, in this expression are optimized by the SVM to enforce as much as possible large 
positive scores for genes in the class +1 and large negative scores for genes in the class —1 in the 
training set. A parameter, often called C, allows to control the possible overfitting to the training 
set. The scoring function fix) can then be used to rank genes with unknown class by decreasing 
score, from the most likely to belong to class +1 to the most likely to belong to class —1. 

The kernel K (x, y) defines the similarity measure used by the SVM to build the scoring function. 
In our experiments we want to infer regulations from gene expression data. Each collection of gene 
expression data is a vector, so we simply use the common Gaussian radial basis function kernel 
between vectors u and v: 

I \u — v\ 



K(u, v) = exp 



2a 2 



where a > is the bandwidth parameter of the kernel. 

Each SVM has therefore two parameters, C and a. In order to limit the risk of overfitting and 
positive bias in our performance evaluation that could result from an over-optimization of these 
parameters on the benchmark data, we simply fix them for all SVM to the unique values C = +oo 
and a = 8. The value C = +oo mean s that we train hard-margin SVM, which is always possible 
with a Gaussian kernel ( Vapnik . 19981 ) . The choice a = 8 was based on the observation that we 



use expression profiles for 445 microarrays scaled to zero mean and unit standard deviation, i.e., 
each gene is represented by a vector of dimension 445 and of length v445 ~ 21. Hence the distance 
between two orthogonal profiles is of the order of v2 x v445 ~ 32. We expect that a bandwidth 
of the order of a = 8, which puts two orthogonal profiles at about 4<r from each other, is a safe 
default choice. We performed preliminary experiments with different values of C and a, which did 
not result in any significant improvement or decrease of performance, suggesting that the behaviour 
of SIRENE is robust to variations in its parameters around these default values. All results below 
were obtained with this default parameter choice. 



2.3 Choice of negative examples 

SIRENE being a supervised inference algorithm, two sets of positive and negative training examples 
are needed for each SVM. Although regulations reported in databases such as RegulonDB can safely 
be taken as positive training examples, the choice of negative examples is more problematic for two 
reasons. First, few information is published and archived regarding the fact that a given TF is 
found not to regulate a given target gene. Hence there is no systematic source of negative examples 
for our problem. A natural choice is then to take TF-gene pairs not reported to have regulatory 
relationships in databases as negative examples, mixing both true negative and false negative. In 
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that case, we are then confronted with the second problem which is that, once a hard-margin SVM is 
trained on positive and negative examples, it always predict significantly negative scores on negative 
examples used during training. As a result it is not possible to use the SVM score on genes used 
during training if we want to find TF-pairs that were wrongly assigned to the negative class. 

To overcome this issue, we propose the following scheme. Let us suppose we want to predict 
whether genes are regulated of not by a given TF. All genes known to be regulated by this TF form 
a set of positive examples, and no prediction is needed for them. The other genes are split in 3 
subsets of roughly equal size. Then, in turn, each subset is taken apart, and a SVM is trained with 
all positive examples and all genes in the two other subsets as negative examples. The SIRENE 
score for the genes in the subset left apart is the SVM prediction score on these genes, which were 
not used during SVM training. Repeating this loop 3 times, we obtain the SIRENE score for all 
genes with no known regulation by the TF. This process is then repeated for all other TF one by 
one. The advantage of this procedure is that, even though there are false negative in the training 
set of each SVM, the predictions on the genes not used during training can still be positive if some 
of these genes look similar to the positive training examples. 



2.4 CLR 



We compare the performance of SIRENE with CLR, a metho d for gene network reconstruction from 
gene expression data that was shown by iFaith et all (120071 ) to be state-of-the-art on a large-scale 



bench mark evaluation. CLR an extension of the relevance networks class of algorithm (jButte et al 



200Ch . which predict regulations between TF and genes when important mutual information can be 
detected. In the case of CLR, an adaptive background correction step is added to the estimation 
of mutual information. For each gene, the statistical likelihood of the mutual information score is 
computed within its network context. Then, for each TF-target gene pair, the mutual information 
score is compared to the context likelihood of both the TF and the target gene, and turned into a 
z-score. Putative TF-gene interactions are then ranked by decreasing z-score. 



2.5 Experimental protocol 

In order to assess the performance of SIRENE as an inference engine, and compare it with other 
existing methods, we test it on a benchmark of known regulatory network. However, SIRENE being 
a supervised method, we adopt a cross-validation procedure to make sure that its performance 
is measured on prediction not used during the model training step. Consequently we adopt the 
following 3-fold cross validation strategy, coherent with the SIRENE protocol to make predictions 
explained in Section l2~3l Given a set of TF, a set of genes, and a set of known TF-gene regulations 
within these sets, we split randomly the set of genes in 3 parts, train the SVM for each TF on two 
of these subsets, and evaluate their prediction quality on the third subset, i.e., on the regulations of 
those genes that were not used during training (Figure [1]). This process is repeated 3 times, testing 
successively on each subset, and the prediction qualities of all folds are averaged. 

In this cross-validation procedure, a particular attention must be paid to the existence of tran- 
scription units and operons in E. coli. Indeed, a given TF typically regulates all genes within an 
operons, which moreover usually have very similar expression profiles. As a result, if genes within 
an operon are split between a training and a test set, then the SVM prediction is likely to be correct 
simply because the SVM will predict that a test gene with a profile very similar to a training gene 
should be in the same class. In other words, the SVM can probably easily recognize operons and 
make correct predictions due to the presence of operons. However we are interested here in the pre- 
diction of inference of regulations for new operons. To simulate this problem in our cross-validation 
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Figure 1: Cross validation for the transcriptional regulatory graph 



setting, we make sure that all genes that belong to the same operon are in the same subset of 
genes, i.e., are always either in the training set or in the test set together. In our experiments below 
we report results both an a classical cross-validation setting, and on this particular scheme that 
preserves the integrity of operons in the train/test splits. 

The CLR algorithm is evaluated with the same protocol. However, since CLR is unsupervised, 
the training set is not used in each fold, and the final ROC and precision/recall curves are equiva- 
lently obtained by computing the curves on all genes simultaneously. 

To evaluate the quality of a prediction we rank all possible TF-gene regulation in the test set 
by decreasing score, and compute both the receiving operating characteristic (ROC) curve and the 
precision/recall (PR) curve. The ROC curve plots the recall, i.e., the percentage true interactions 
that have a score above a threshold, as a function of the false positive rate, i.e., the fraction of 
negative interactions that have a score above a threshold, when the threshold varies. The PR curve 
plots the precision, i.e., the percentage of true positive among the predictions above a threshold, 
as a function of recall, when the threshold varies. One ROC and PR curve is obtained in each fold 
of cross-validation, and these curves are averaged over the three folds to yield the final estimated 
ROC and PR curve. 



3 Data 

We us ed in our experiments the expression and regulation data made publicly available by lFaith et al. 



(|2007h for E. colt, and downloaded from http://gardnerlab.bu. edu/netinfer_plos_2007/?page_id=5 . The ex 



pression data consist of a compendium of 445 E. coli Affymetrix Antisense2 microarray expression 
profiles for 4345 genes. The microarrays were collected under different experimental conditions such 
as PH changes, growth phases, antibiotics, heat shock, different media, varying oxygen concen- 
trations and numerous genetic perturbations. The expression data for each gene were normalized 
to zero mean and unit standard deviation. The regulation data consist of 3293 experimentally 
confirmed regulation s between 154 TF and 1211 genes, extracted from the RegulonDB database 
dSalgado et allhood ). 



We downloaded the list of 899 known operons in E. coli from RegulonDB. Each operon contains 
one or several genes, and each gene belongs to at most one operon. Genes not present in any of the 
regulonDB were considered to form an operon by themselves, resulting in a total of 3360 operons 
for the 4345 genes. This operon information was used to create the folds in the cross-validation 
procedure, as explained in Section 12.51 
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4 Results 



S IREN E was compared to CLR and other algorithms on the E coli benchmark used by iFaith et al 
and described in the previous section. Figure [2] shows the ROC and PR curves of CLR 
and SIRENE. The two curves for the later, labeled SIRENE and SIRENE-Bias, are respectively 
obtained when we use the cross-validation protocol presented in Section 12,51 and when we use a 
classical cross-validation scheme where genes within a known operon can be split between training 
and test sets. 
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Figure 2: Comparison of performance between CLR and SIRENE. (A) ROC curves, and (B) preci- 
sion/recall curves. The SIRENE curve corresponds to the SIRENE algorithm evaluated by 3-fold 
cross-validation, when genes within an operon are never split between the training and the test set. 
The SIRENE-bias curve is the same algorithm evaluated by classical 3-fold cross-validation, where 
genes are randomly assigned to training and test sets. 



CLR scores were obtained di r ectly from Faith et al. (j2007l ). The PR curve of CLR is similar 



to that presented by lFaith et al. 



confirming that we use the exact same benchmark. Both 
for ROC and PR, SIRENE performance curves are significantly above CLR. SIRENE-bias is itself 
much better than SIRENE, confirming the importance of the evaluation bias if operons are split 
artificially between training and test sets in the cross-validation procedure. In what follows we 
restrict ourselves to the analysis of the results of SIRENE in the correct cross-validation protocal. 

The PR curve is particularly relevant because the number of true regulations is very small 
compared to the total number of possible TF-gene pairs. We see that the recall obtained by 
SIRENE, i.e., the proportion of known regulations that are correctly predicted, is several times 
larger than the recall of CLR at all levels of precision. More precisely, Table Q] compares the recalls 
of SIRENE, CLR and several other meth ods at 80% and 60% precision. Th e other methods reported 
ar e relevance network (Butte et al. . 200G), ARACNe ( Margolin et al. . 20061 ). and a Bayesian network 



( Friedman et "a/l l200rt ) i mplemented bvlFaith et al\ (|2007l ). The performance of these three methods 



was taken directly from lFaith et all ( 200?! ). 

At 60% precision, SIRENE predicts 6 times more known regulation s than CLR, which was the 



best among all methods tested on this benchmark by Faith et al. ( 20071 ). With 44.5% recall at this 
precision level, the performance of SIRENE allows one, in principle, to retrieve almost half of all 
known regulations. 
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Table 1 : Recall of different gene regulation prediction algorithm at different levels of precision (60% 
and 80%). The v alues for relevance network, ARACNe and Bayesian network were taken from 
Faith et all (l2007l l. 



Method 


Recall at 60% 


Recall at 80% 


SIRENE 


44.5% 


17.6% 


CLR 


7.5% 


5.5% 


Relevance networks 


4.7% 


3.3% 


ARACNe 


1% 


0% 


Bayesian network 


1% 


0% 



The main conceptual difference between SIRENE and other methods is that SIRENE is a super- 
vised method that requires known regulations to train its models. As an attempt to understand why 
the performance of SIRENE was better than that of other state-of-the-art unsupervised methods, we 
reasoned that TF with a large number of known regulated target genes could better take advantage 
of the supervised setting, and therefore that predictions for these TF should in general be better 
than predictions for TF with few known targets. To validate this hypothesis, we computed the ROC 
curve for SIRENE by cross-validation, restricted to the prediction of targets for each individual TF 
in turn. For each TF, we then computed the area under the ROC curve (AUC) as an indicator 
of how well the targets of each particular TF are predicted. We did this estimation for both CLR 
and SIRENE, and show in Figure [3] the distributions of AUC scores for all TF as a function of the 
number of known target genes in RegulonDB, for both CLR and SIRENE. As expected, the values 
for SIRENE tend to be larger than those for CLR. More importantly, we observe in the SIRENE 
plot a trend to have better AUC values for TF trained on more known targets. This trend is not 
present for CLR, which does not benefit from the knowledge of more or less targets for each TF. 
This result was expected and suggests that, as our knowledge expands and the number of known 
regulations continues to increase, so will the performance of supervised methods like SIRENE. 




Figure 3: AUC per TF as a function of the number of regulated genes. (A) CLR and, (B) SIRENE 



Having validated the relevance and performance of SIRENE on the regulonDB benchmark, we 
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performed a global prediction of the E. coli regulatory network at 60% precision in order to predict 
new regulations in E. coli. More precisely, for each of the 154 TF with at least one known target 
in RegulonDB we computed the SIRENE score for all E. coli genes (4345 in total) that were not 
known targets, using the protocol described in Section 12.31 The RegulonDB database contained 
3293 known TF-target regulations, so we assigned a score to the 4345 x 154 — 3293 = 665837 other 
candidate TF-gene pairs. From the cross-validation experiment we calibrated the level of SIRENE 
score threshold associated to various levels of precision. We selected all pairs with a score above 
a threshold of —0.41, corresponding to an estimated precision of 60%. At this threshold, 991 new 
regulations were predicted in addition to the 3293 known ones. Combining known and predicted 
regulations we obtained a regulatory network with 4284 edges involving 1688 genes. 

In order to illustrate some predicted regulations, we focus now on the regulations of TF by other 
TF. Removing all non-TF genes of the predicted network, we obtain a graph with 131 TF and 349 
interactions among them (TF with no interaction were removed). Among them, the rpoD gene, 
which codes for the RNA polymerase sigma factor, accounts al one for 85 regu l ations . In order to 
obtain a picture easier to visualize with the Cytoscape software ( Shannon et al. . 20031 ). we removed 
rpoD from this graph, and only kept the main connected component which is shown in Figure HI 
This core regulatory network involves 90 TF, and combines 196 known regulations among them 
with 32 predicted ones. 



! atoC 




Figure 4: Main connected component of the predicted regulatory network among TF of E. coli, at 
an estimated 60% precision level. For clarity purpose the rpoD gene was removed from this picture. 
Grey arrows indicate known regulations, blue arrows indicate new predicted interactions. 



Most regulations in this densely connected region of the E. coli regulatory network have been 
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investigated in detail, and it not a surprise that the number of newly predicted regulations is 
limited. Still a quick survey of the literature can confirm some of these predictions. For example, 
four new regulators are predicted for yhiW (crp,hns,rpoS,yhiX and itself), which is itself predicted 
to regulate yhiE. Although these regulations were not present in the database used to train the 
model, they are confirmed by the literature. The GadW protein coded by yhiW is a regulator that 
participates in controlling several genes of the acid resistance system. It is indeed regulated by the 
proteins coded by yhiX and by the general proteins crp,hns,rpoS that control resistance to acidity 
through the gad system that utilizes two isoforms of glutamate decarboxylase encoded by gene 
regions qadA and qadB and a putative glu t amat e .-aminobu t yric a cid antiporter encoded by gadC 
([Tucker et ali . 12002 : I Waterman and Smalt 120031 : iMa et oil booah . Another predicted regulation 
that was confirmed by a literature search is the dependence of hcaR, a TF involved in th e oxidative 
stress response, by a functional CAP protein encoded by the crp gene (jTurlin et a/.l . l200lh . Although 
preliminary, these first validations confirm the relevance of the approach and may suggest further 
experimental validations for subsystems of interest. 

SIRENE is easy to implement and scales well to large-scale inference. Indeed, the main idea 
behind SIRENE is to decompose the network inference into a set of local binary classification 
problems, aimed at discriminating targets from non-targets of each TF. Although we used a SVM 
as a basic algorithm to solve these local problems, any algorithm for pattern recognition may be 
used instead. Each local problem involves at most a training set of a few thousands genes, easily 
manageable by most machine learning algorithms. This strategy also paves the way to the use of 
other genomic data to predict regulation. Indeed, local models for gene classification often improve 
in performance when several data, such as phylogenetic or cell subcellular localization information 
i s available, and SVM provide a convenient framework to practically perform this data integration 
([Lanckriet et all 12004 : iBleakley et all 120071 ). Another interesting features of SIRENE is its ability 
to predict self-regulation, that other methods have generally difficulties to deal with. 

A important limitation of SIRENE is its inability to predict targets of TF with no a priori 
known target. More generally, the performance of SIRENE tends to decrease when few targets are 
known. Thus, for example, it can not be used to discover new transcription factors. An interesting 
direction of future research is therefore to extend the predictions to TF with no known target. A 
possible direction may be to combine the supervised approach with other non-supervised approaches 
in some meaningful way. 
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